hello, world!

Get packages:

# General packages for stuff
library(tidyverse)
library(here)
library(janitor)
library(plotly)

# Packages for spatial stuff & point pattern analysis
library(tmap)
library(sf)
library(spatstat)
library(maptools)
library(sp)
library(raster)

# Packages for cluster analysis:
library(NbClust)
library(cluster)
library(factoextra)
library(dendextend)
library(ggdendro)

Get data:

voles <- read_sf(dsn = here("data","redtreevoledata"), 
                 layer = "ds033") %>% 
  dplyr::select(COUNTY) %>% 
  filter(COUNTY == "HUM") %>% 
  st_transform(crs = 4326)

st_crs(voles) #check projection
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
# Get Humboldt County outline
humboldt <- read_sf(dsn = here("data","redtreevoledata"), 
                    layer = "california_county_shape_file") %>% 
  filter(NAME == "Humboldt") %>% 
  dplyr::select(NAME)

st_crs(humboldt) <- 4326

Initial visualization of the data

plot(voles)

plot(humboldt)

# Plot them together with tmap:
tm_shape(humboldt) +
  tm_fill() +
  tm_shape(voles) +
  tm_dots(size = 0.2)

# Or with ggplot2: 
ggplot() +
  geom_sf(data = humboldt) +
  geom_sf(data = voles)

# Geocomputation in R by Robin Lovelace, free and online

Convert vole events and humboldt polygon to point pattern + window

We want to explore point patterns in a few different ways. Quadrat analysis, nearest neighbor analysis, etc. to compare with.

First we need to convert to ‘ppp’ and ‘owin’ - the points and windows, as used by maptools and spatstat (because sf is still catching up for raster and point pattern analysis stuff)…

## UPDATE!
voles_sp <- as(voles,"Spatial") #ensure R recognizes the voles data set as spatial data. Some functions dont like sf data
# voles_ppp <- as(voles_sp, "ppp") #projection error with package, allison will update later

Intro to cluster analysis (k-means, hierarchical)

k-means (partition-based)

Part 1. K-means clustering:

# Clean data
iris_nice <- iris %>% 
  clean_names()

# Explore data
ggplot(iris_nice) +
  geom_point(aes(x = petal_length, y = petal_width, color = species))

# we think it looks like there are 3 clusters

# Ask R: how many clusters do U think there should be for this dataset?
number_est <- NbClust(iris_nice[1:4], #the brackets tells R to only conisder data in columns 1-4
                      min.nc = 2,
                      max.nc = 10,
                      method = "kmeans") #use ?NbClust to see all the diff methods that can be used

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 11 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

graph output = count of how many algorithms thought the data had x number of clusters. Here the algorithms think 2 groups is most likley, but R is NO substitute for assessing the data visually yourself and we think that since there are 3 species 3 clusters make sense

By these estimators, 2 is the best number of clusters…but should that change our mind? Maybe…

What if we consider similarities across all four variables?

# Run kmeans
iris_km <- kmeans(iris_nice[1:4], 3) # kmeans specifying 3 groups!

iris_km$size
## [1] 33 21 96
iris_km$centers
##   sepal_length sepal_width petal_length petal_width
## 1     5.175758    3.624242     1.472727   0.2727273
## 2     4.738095    2.904762     1.790476   0.3523810
## 3     6.314583    2.895833     4.973958   1.7031250
iris_km$cluster # use to determine what cluster R thinks each observation belongs to
##   [1] 1 2 2 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 2 1 1
##  [38] 1 2 1 1 2 2 1 1 2 1 2 1 1 3 3 3 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3

size outputs have to add up to 100

# Bind the cluster number to the original data
iris_cl <- data.frame(iris_nice, cluster_no = factor(iris_km$cluster)) #can also use as.factor

# Plot
ggplot(iris_cl) +
  geom_point(aes(x = sepal_length, y = sepal_width, color = cluster_no))

A little better…

ggplot(iris_cl) +
  geom_point(aes(x = petal_length, 
                 y = petal_width, 
                 color = cluster_no, 
                 pch = species)) +
  scale_color_brewer(palette = "Set2")

# Or, a 3D plot with plotly
plot_ly(x = iris_cl$petal_length, 
        y = iris_cl$petal_width, 
        z = iris_cl$sepal_width, 
        type = "scatter3d", 
        color = iris_cl$cluster_no, 
        symbol = ~iris_cl$species,
        marker = list(size = 3),
        colors = "Set1")

Part 2. Cluster analysis: hierarchical

Hierarchical cluster analysis (dendrograms) in R

Relevant functions:

  • stats::hclust() - agglomerative hierarchical clustering
  • cluster::diana() - divisive hierarchical clustering

We’ll be using WorldBank environmental data (simplified), wb_env.csv

# Get the data
wb_env <- read_csv(here("data", "wb_env.csv"))

# Only keep top 20 greenhouse gas emitters (for simplifying visualization here...)
wb_ghg_20 <- wb_env %>% 
  arrange(-ghg) %>% 
  head(20)

# Scale it so all variables are on the same scale (can consider this for k-means clustering, too...)
wb_scaled <- as.data.frame(scale(wb_ghg_20[3:7]))

rownames(wb_scaled) <- wb_ghg_20$name # Add back in the rownames (country name) to scaled dataset

Find (Euclidean) distances (compute dissimilarity values/create matrix):

diss <- dist(wb_scaled, method = "euclidean", upper=TRUE)

# Use euclidean distances to do some complete agglomerate clustering [Hierarchical clustering (complete linkage)]
hc_complete <- hclust(diss, method = "complete" )

# plot outputs
plot(hc_complete, cex = 0.6, hang = -1)

# plot the same thing in ggplot
ggdendrogram(hc_complete, rotate = T) +
  theme_classic() +
  labs(x = "Country")

things clustered closer on the dendeogram = closer relation in multivariate space, countries further apart = more dissimilar